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Abstract 


CST  is  a  programming  language  based  on  Smalltalk-80  that  supports  concurrency  using  locks,  asynchronous  messages, 
and  distributed  objects.  In  this  paper,  we  describe  CST:  the  language  and  its  implementation.  Example  programs 
and  initial  programming  experience  with  CST  is  described.  An  implementation  of  CST  generates  native  code  for  the 
J -machine,  a  fine-grained  concurrent  computer.  Some  novel  compiler  optimizations  developed  in  conjunction  with 
that  implementation  are  also  described. 


Introduction 


This  paper  describes  CST,  an  object-oriented  concurrent  programming  language  based  on  Smalltalk-80  [7]  and  an 
implementation  of  that  language.  CST  adds  three  extensions  to  sequential  Smalltalk.  First,  messages  are  asyn¬ 
chronous.  Several  messages  can  be  sent  concurrently  without  waiting  for  a  reply.  Second,  several  methods  may 
access  an  object  concurrently;  locks  are  provided  for  concurrency  control.  Finally,  CST  allows  the  programmer 
to  describe  distributed  objects:  objects  with  a  single  name  but  distributed  state.  They  can  be  used  to  construct 
abstractions  for  concurrency. 

CST  is  being  developed  as  par.  of  the  J -Machine  \  roject  at  MIT  [4,  3].  The  J-Machine  is  a  fine-grain  concurrent 
computer.  The  primary  building  block  in  the  J-machine  is  the  Message-Driven  Processor  (MDP).  It  efficiently 
executes  tasks  with  a  grain  size  of  10  instructions  and  supports  a  global  virtual  address  space.  This  machine  requires 
a  programming  system  that  allows  programmers  to  concisely  describe  programs  with  method-level  concurrency  and 
that  facilitates  the  development  of  abstractions  for  concurrency. 

Object-oriented  programming  meets  the  first  of  these  goals  by  introducing  a  discipline  into  message  passing.  Each 
expression  implies  a  message  send.  Each  message  invokes  a  new  process.  Each  receive  is  implicit.  The  global  address 
space  of  object  identifiers  eliminates  the  need  to  refer  to  node  numbers  and  process  IDs.  The  programmer  does  not 
have  to  insert  send  and  receive  statements  into  the  program,  keep  track  of  process  IDs,  and  perform  bookkeeping  to 
determine  which  objects  are  local  and  which  are  remote. 

For  example,  a  CST  program2  that  counts  the  number  of  leaves  in  a  binary  tree  using  double  recursion  is  shown 
in  Figure  1.  Nowhere  in  the  program  does  the  programmer  explicitly  specify  a  send  or  receive,  and  no  node  num¬ 
bers  or  process  IDs  are  mentioned.  Yet,  as  shown  in  Figure  l3  the  program  exhibits  a  great  deal  of  concurrency. 
Making  message-passing  implicit  in  the  language  simplifies  programming  and  makes  it  easier  to  describe  fine-grain 
concurrency. 

CST  facilitates  the  construction  of  concurrency  abstractions  by  providing  distributed  objects:  objects  with  a  single 
name  whose  state  is  distributed  across  the  nodes  of  a  concurrent  computer.  The  one-to-many  naming  of  distributed 
objects  along  with  their  ability  to  process  many  messages  simultaneously  allows  them  to  efficiently  connect  together 

‘The  research  described  in  this  paper  was  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency  and  monitored 
by  the  Office  of  Naval  Research  under  contracts  N 00014- 88K -0738,  N00014-87K-082S,  and  N00014-8&-K-0124,  in  part  by  a  National 
Science  Foundation  Presidential  Young  Investigator  Award  with  matching  funds  from  General  Electric  Corporation,  an  Analog  Devices 
Fellowship,  and  an  ONR  Fellowship. 

2Thit  program  is  in  yre/ur  CST,  a  dialect  that  has  a  syntax  .resembling  LISP.  Jnfij  CST  [S]  has  a  syntax  closer  to  that  of  Smalltalk-80 
JThe  concurrency  profiles  presented  in  this  paper  are  produced  by  an  Icode  level  simulation  of  CST  programs. 
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(class  nods  (objact)  laft  right  trss-nods?) 


(aathod  aoda  count-slsnents  0  () 

(it  traa-noda?  (♦  (count -slsaants  laft) 

(count -slsaants  right)) 


D) 


Figure  1:  A  CST  program  that  calculates  the  number  of  leaves  in  a  tree  using  double  recursion.  Its  concurency 
profile  (active  tasks  in  each  message  interval)  is  shown  to  the  right. 


large  numbers  of  objects.  Distributing  the  name  of  a  single  distributed  queue  to  sets  of  producer  and  consumer 
objects,  for  example,  connects  many  producers  to  many  consumers  without  a  bottleneck. 

The  Optimist  compiler  [8]  compiles  Concurrent  Smalltalk  to  the  assembly  language  of  the  Message-Driven  Processor 
(MDP)  [9].  It  includes  many  standard  optimisations  such  as  register  variable  assignment,  dataflow  analysis,  copy 
propagation,  and  dead  code  elimination  [2,  13]  that  are  used  in  compilers  for  conventional  processors.  Due  to  the  fine¬ 
grained  parallel  nature  of  the  J-machine,  compiling  for  the  MDP  is  unlike  compiling  for  most  conventional  processors 
in  »  few  important  aspects.  For  instance,  loops  are  not  important4,  while  minimising  code  size,  tail  forwarding 
methods,  and  efficiently  and  seamlessly  handling  parallelism  are  extremely  important. 

The  development  of  Concurrent  Smalltalk  was  motivated  by  dissatisfaction  with  process-based  concurrent  program¬ 
ming  using  sends  and  receives  [11].  Many  of  the  ideas  have  been  borrowed  from  actor  language  [l].  Another  language 
named  Concurrent  Smalltalk  has  been  developed  at  Keio  University  in  Japan  [14].  This  language  also  allows  message 
sending  to  be  asynchronous,  but  does  not  include  the  ability  to  describe  distributed  objects. 


Concurrent  Smalltalk 


Top-Level  Expressions 


A  CST  program  consists  of  a  number  of  top-level  expressions.  Top  level  forms  include  declarations  of  program 
and  data  as  well  as  executable  expressions.  Linking  of  programs  (the  resolution  from  selectors  to  methods)  is  done 
dynamically. 

<«op-exp>  :■  (Global  <global-Tariablo>  <valua>)  t 
(Constant  cconstsnt-n sn>  <valuo»  I 
(Claas  <claaa-nano>  (<auporclasi  '  inatanco-vars»  I 
OUthod  <claaa-n— >  Oathed-aaa  »> 

(<toraals»  (<locals» 

<oxprasaiona>)  I 
<sxpr*ssioa> 


Globals  and  Constants  Globals  and  constant  declarations  define  names  in  the  environment.  These  names  are 
visible  in  all  programs,  unless  shadowed  by  a  instance,  argument,  or  local  variable  name.  The  global  declaration 
simply  defines  the  name.  Its  value  remains  unbound.  The  constant  declaration  defines  the  name  and  binds  the  name 
to  the  specified  value. 


Classes  Objects  are  defined  by  specifying  classes.  Objects  of  a  particular  claas  have  the  same  instance  variables 
and  understand  the  same  messages.  A  class  may  inherit  variables  and  methods  from  one  or  more  superclasses.  For 

4  In  {act,  the  current  version  of  Concurrent  Joes  net  even  have  loops. 
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example: 


(class  nods  (object)  left  right  tree-node?) 

defines  a  class,  nods,  that  inherits  the  properties  of  class  object  and  adds  three  instance  variables.  This  means 
that  methods  for  the  class  nods  can  access  all  the  instance  variables  of  class  object  as  well  as  those  defined  in  their 
own  class  definition.  Methods  defined  for  class  object  are  also  inherited.  Of  course,  this  inheritance  is  transitive, 
so  nods  actually  inherits  from  a  series  of  classes  up  through  the  top  of  the  class  hierarchy.  Instance  variables  in  the 
class  definition  may  hide  (shadow)  those  defined  in  the  superclasses  if  they  have  the  same  name.  The  same  kind  of 
shadowing  is  allowed  for  selectors  (method  names). 


Methods  The  behavior  of  a  class  of  objects  is  defined  in  terms  of  the  messages  they  understand.  For  each  message, 
a  method  is  executed.  That  execution  may  send  additional  messages,  modify  the  object  state,  modify  the  object 
behavior,  and  create  new  objects.  Methods  consist  of  a  header  and  a  body.  The  header  specifies  class,  selector, 
arguments,  and  locals.  The  body  consists  of  one  or  more  expressions.  For  example: 

(sethod  nod*  count -«la>«nta  ()  () 

(if  tree-node?  (♦  (count-eleaents  left) 

(count-elements  right)) 

D) 

defines  a  method  for  class  node  with  selector  count-eleaents.  The  two  empty  lists  indicate  that  there  are  no  explicit 
arguments  and  no  local  variables.  If  present,  the  keyword  reply  sends  the  result  of  the  following  expression  back 
to  the  sender  of  the  count-elements  message.  In  this  case,  there  is  no  reply  keyword,  so  the  method  replies  with 
the  value  of  the  last  expression.  If  the  programmer  wishes  to  suppress  the  reply,  he  can  use  the  (exit)  form  which 
causes  the  method  to  terminate  without  a  reply. 

Messages  are  sent  implicitly.  Every  expression  conceptually  involves  sending  a  message  to  an  object.  Of  course, 
commonly  occurring  special  cases,  like  adding  two  local  integers,  will  be  optimized  to  diininate  the  send.  For 
example,  (count-elements  left),  sends  the  message  count-elements  to  left.  (*  z  y)  sends  the  message  ♦  with 
argument  x  to  object  y.  If  both  x  and  y  are  local  integers,  this  operation  can  be  optimized  as  an  add  instruction. 

Each  expression  consists  of  a  selector,  a  receiver,  and  zero  or  more  arguments.  Identifiers  must  be  one  of:  constant, 
global  variable,  argument,  local  variable,  or  instance  variable.  Subexpressions  may  be  executed  concurrently  and  are 
sequenced  only  by  data  dependence.  For  example,  in  the  following  expression  from  the  program  in  Figure  1 

(♦  (count -elements  left)  (count-slessnts  right)) 

the  two  count-elements  messages  will  be  sent  concurrently  and  the  ♦  message  will  be  sent  when  both  replies  have 
been  received.  The  only  way  to  serialize  subexpression  evaluation  is  to  assign  intermediate  results  to  local  variables. 

A  complete  list  of  CST  expressions  is  shown  below: 


<exps>  :•  <exp>-» 

<exp>  :■ 

<nam«>  I 

(<selector>  <recsiver-sxp>  <argusent-exp>*)  I 

(send  <s«lector-«xp>  <rec«iver-«xp>  <arguaent-exp>e)  I 

(value  <exp>)  I 

(set  <nase>  <exp>)  I 

(cast  <aaae>  <exp>)  I 

<node>  <selector>  <rec*iv«r>  <actuals>)  I 
(forward  <continuation>  <selector>  <receiver>  <args>)  I 
(reply  <exp>)  I 
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(block  (<larnala>)  (<locels>)  <ozpo>)  I 
(11  <exp>  <«xp>  <«zp>)  I 
(begin  <expe>)  I 
(exit) 


An  Example  CST  Program 

We  now  introduce  a  (lightly  more  complicated  version  of  the  program  shown  in  Figure  1.  Rather  than  (imply  counting 
the  leaves  on  a  tree,  we  compute  the  lengths  of  all  the  lists  linked  to  the  tree  and  sums  those  lengths  together. 


(class  nods  (object)  loft  right) 

(asthod  nods  count-list-eleasnts  ()  () 

(♦  (count-llst-slsMBta  left) 

(count-list-eleasBts  right))) 

(class  pair  (object)  car  cdr) 

(asthod  pair  count -list -el asset •  ()  () 

(length  right  0  ))) 

(asthod  pair  length  (n)  () 

(if  (e<)  cdr  ‘nil)  (+  1  a) 

(length  cdr  (+  1  n)))) 

Figure  2:  A  CST  program  that  computes  sum  of  list  lengths  and  its  execution  profile 


mm  mmm  sMswum— »ii  ■wkuwauisatss 


The  node  class  definition  is  the  same  as  it  was  in  Figure  1.  left  and  right  are  the  children  of  the  current  node 
in  a  binary  tree.  The  right  of  each  leaf  node  points  to  a  linked  list  of  pairs.  The  method  count-liet~el sweats 
recursively  counts  the  lists  lengths  by  doing  so  for  the  right  subtree  and  the  left  subtree  concurrently.  At  the  bottom 
of  the  tree,  the  late  binding  SEND  operation  causes  the  couat-list-elswents  method  for  pain  to  be  invoked.  This 
method  computes  the  length  of  each  list  using  the  tail  recursive  method  length. 


Distributed  Objects 


CST  programs  exhibit  parallelism  between  objects,  that  is  many  objects  may  be  actively  processing  messages  si¬ 
multaneously.  However,  ordinary  objects  can  only  receive  one  message  at  a  time.  CST  relaxes  this  restriction  with 
Distributed  Objects  (DOs).  Distributed  objects  are  made  up  of  multiple  representatives  (constituent  objects)  that 
can  each  accept  messages  independently.  The  distributed  object  has  a  name  (Distributed  object  ID  or  DID)  and  all 
other  objects  send  messages  to  this  name  when  they  wish  to  use  the  DO. 

Messages  sent  to  the  DO  are  received  by  one  and  only  one  constituent  object  (CO).  Which  constituent  receives 
the  message  is  left  unspecified  in  the  language.  A  clever  implementation  might  send  the  mnsssgra  to  the  closest 
constituent  whereas  a  simpler  implementation  might  send  the  messages  to  a  random  constituent.  The  state  of  a 
distributed  object  is  typically  distributed  over  the  constituents.  This  mmana  that  responding  to  an  external  request 
often  requires  the  passing  of  messages  amongst  the  constituents  before  replying.  No  locking  ■  performed  on  the 
distributed  object  as  a  whole.  This  means  that  the  programmer  must  ensure  the  consistency  of  the  distributed 
object. 
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Support  for  Distributed  Objects 


CST  includes  two  constructs  to  support  distributed  objects.  For  DO  creation,  we  add  an  argument  for  the  ns*  selector 
-  the  number  of  constituents  desired  in  this  DO.  In  order  to  pass  messages  within  the  object,  each  constituent  object 
must  be  able  to  address  each  of  the  other  constituents.  This  is  implemented  with  the  special  selector  co.  Each 
distributed  object  can  use  this  selector,  the  special  instance  variable  group  (a  reference  to  the  DO),  and  an  index 
to  address  any  constituent.  For  example,  (co  group  S)  refers  to  the  5th  constituent  of  a  distributed  object.  Each 
constituent  also  has  access  to  its  own  index  and  the  number  of  constituents  in  the  entire  distributed  object.  Thus  a 
description  of  a  distributed  object  might  look  something  like  the  example  shown  in  Figure  3. 


i  OKtrlbutod  Array  Abstraction. 

The  constituents  ars  served  threu*tout  tha  aeehtne. 

The  array  state  la  elloeeted  Into  equal  slice  chunks  an  the  constituents, 
(class  dlstarray  (dlateej)  nr-elta  chunk-size  alt-array) 


Slven  an  uninitialized  DO,  Inlt  Bakes  each  one  an  array, 
tells  it  how  Bony  alts  it  has,  and  hew  weny  eleaonts 
are  In  the  entire  array. 


(aalhod  dlstarray  Inlt  (arr-slza)  () 

(do-i  self  (block  (consul  alts)  () 

( co- inlt  (ce  (group  eonstlt)  (By Index  const  It))  alts) 
(reo'y  eonstlt)) 

err-size)) 

;;  helper  for  Inlt 

(wetned  dlstarray  ce-lnit  (alts)  () 

(begin  (sat  chunk-size  (/  alts  (•  I  aexlndex))) 

(sat  nr-elts  alts) 

(sat  alt-array  (new  array  chunk-size)) 

)) 


Tree  recursive  apply,  with  one  argiaent 

(nethod  dlstarray  do-1  (abteck  ergi)  ()  (lde-1  (ce  group  0)  sblock  argi)) 
(nethed  dlstarray  ldo-1  (ablock  argi)  (a  b  Unde*  rtndex) 

(tat  lindax  (linden  self)) 

(set  rindee  (rlndex  aelfjj 

(cast  a  (If  (<•  lindax  Bax Index)  (ldd-1  (ce  group  lindax)  ablock  argi) 

’())) 

(cast  b  (If  (<*  r index  ntxjndex)  ( )de-l  (ce  group  rtndex)  ablock  argi) 

•<))) 

(touch  a  b) 

(reply  (value  ablock  self  argi)) 

(exit)) 


Select  array  elcawnl  at  Index 

(nethod  dlstarray  at  (Index)  (ae lector) 

(If  (er  (<  Index  (e  chunk-size  nylndex)) 

(>•  index  (•  chunk-size  (♦  wy index  1)))) 

(begin  (aat  selector  (truncate  (/  index  chunk-size))) 

(forward  requester  at  (ce  group  selector)  index) 

(exit)) 

(at  alt-array  (aod  index  chunk-size)))) 

Set  array  elaaent  at  index  te  value 

(aathod  dlstarray  at. put  (Index  value)  (selector) 

(If  (er  (<  Index  (t  chunk-size  Byindex) ) 

(>•  Index  (•  chunk-size  (*  eytndex  1)))) 

(begin  (act  selector  (truncate  (/  index  ehunk-slza))) 

( forward  requester  at. put  (ca  group  selector)  index  value) 
(axil)) 

(at. put  alt-array  (aod  Index  chunk-size)  value))) 

;;  te  stake  a  dlstarray  ef  IS*  constituents  and  101*  elaaanta  da 
;;  (inlt  (new  dlstarray  JM)  logs) 


Figure  3:  A  Distributed  Array  Example 


In  the  example  of  the  distributed  array,  we  would  create  a  usable  array  with  two  steps.  First  we  construct  the. 
distributed  object  using  the  ns*  form.  The  example  in  Figure  3  creates  a  distributed  object  with  256  constituents. 
After  the  DO  is  created,  we  must  initialize  in  a  way  that  is  appropriate  for  the  distributed  array.  We  do  so  by  sending 
it  an  init  message  (also  defined  in  Figure  3).  This  initialization  sets  each  constituent  up  with  an  private  array  of  the 
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appropriate  number  of  elements.  For  example,  if  we  wanted  a  distarray  of  512  elements,  in  this  case  each  constituent 
would  have  a  private  array  of  two  elements.  This  initialization  is  done  in  a  tree  recursive  fashion  and  therefore  takes 
0(lg(n))  time. 

The  mapping  of  the  distarray  elements  onto  the  private  arrays  is  done  by  the  at  and  at. put  methods.  Each 
constituent  is  responsible  for  a  contiguous  range  of  the  distarray  elements.  Any  requests  received  by  a  constituent 
are  first  checked  to  see  if  they  are  within  the  local  CO’s  jurisdiction.  If  they  are  not,  they  are  forwarded  to  the 
appropriate  CO.  If  they  are,  the  request  is  handled  locally.  This  is  a  particularly  simple  example  because  each 
constituent  is  wholly  responsible  for  his  subrange  and  need  not  negotiate  with  other  constituents  before  modifying 
his  local  state. 

Distributed  objects  are  of  great  utility  in  building  large  objects  on  a  fine  grain  machines.  In  the  J-machine,  we  restrict 
ordinary  objects  to  fit  within  the  memory  of  single  node,  thus  restricting  object  si2e.  With  distributed  objects,  we 
only  require  that  a  constituent  of  the  DO  fit  on  a  single  node.  Some  useful  examples  for  distributed  objects  are 
dictionaries,  distributed  arrays,  sets,  queues,  and  priority  queues. 


Experience  with  CST 


We  have  written  a  large  number  of  Concurrent  Smalltalk  programs  and  executed  them  on  our  Icode  simulator.  These 
programs  include  various  data  structures,  distributed  arrays,  sets,  rings,  B-treea,  grids,  and  matrices.  They  also 
include  several  application  kernels:  N-body  interaction  and  charged  particle  transport  (Particle-in-celi  algorithm). 
To  date,  the  programs  studied  range  from  toys  to  applications  of  over  1000  lines.  It  is  clear  from  our  experience 
that  CST  programs  exhibit  large  amountr  or  parallelism.  However,  we  are  just  beginning  to  exploit  the  potential  of 
Distributed  Objects  as  building  blocks  for  concurrent  programs.  We  will  continue  to  study  data  structures,  algorithms 
and  full-blown  applications  in  our  continuing  evaluation  of  Concurrent  Smalltalk. 


The  Optimist  Compiler  for  CST 


Goals 

The  main  goal  of  the  Optimist  compiler  is  to  produce  Concurrent  Smalltalk  code  that  is  as  small  as  possible  without 
sacrificing  speed.  In  almost  all  cases  optimizations  that  reduce  space  also  reduce  speed,  but  there  are  a  few  cases  in 
which  they  conflict;  in  those  cases  the  decisions  were  made  in  favor  of  optimizing  space.  Compilation  speed  was  not 
a  major  goal  of  the  compiler  project;  simplicity  and  flexibility  were  considered  more  important.  Still,  the  compiler 
does  achieve  reasonable  compilation  speed,  taking  between  one  and  fifteen  seconds  to  compile  mc*t  methods  on  a 
2-megabyte  Macintosh5  II  using  Coral  Software’s  Allegro  Common  Lisp. 


Organization 


The  Optimist, compiler  is  comprised  of  four  phases,  as  shown  in  Figure  4.  The  Concurrent  Smalltalk  Front  End  can 
be  replaced  by  other  front  ends  to  compile  other  languages  for  the  MDP.  Also,  the  Icode  can  be  extracted  from  two 
places  in  the  compilation  process  and  either  compiled  onto  different  hardware  or  run  on  an  Icode  simulator. 

The  source  code  is  converted  by  the  Front  End  into  an  intermediate  language  called  Icode.  The  Icode  is  at  a 
somewhat  higher  level  than  the  triples  or  quadruples  codes  that  most  compilers  use,  in  that  it  specifies  units  such 
as  entire  procedure  calls  in  single  instructions.  The  Icode  also  allows  for  the  possibility  of  having  more  than  one 
source  language  compile  into  MDP  assembly  language  code  or  having  the  same  source  language  compile  into  several 
assembly  languages.  Figure  5  shows  the  length  method  in  its  Icode  form. 

1  Macintoih  it  a  trademark  of  Apple  Computer,  Inc. 


6 


CSTSounaCod* 


|  from  End  | 


I -Cod* 


MDPA— nblyGad* 


Olhar  Front  End* 
F-Cod*  Simulator 


1-Cod*  Simula  tor 


Figure  4:  Compiler  Organization. 


(CSEID  (THXP  0)  (METHOD  EQ)  (IV AH  1}  (COIST  IIL)) 
(FALSE JUMP  (TEMP  0)  0) 

(CSEID  (TEMP  1)  (METHOD  ♦)  (COIST  1)  (ARQ  0)) 

(JEW  1) 

(LABEL  0) 

(CSEID  (TEMP  2)  (METHOD  ♦>  (COIST  1)  (ARC  0)) 
(CSEID  (TEMP  1)  (METHOD  LEIOTH)  (IVAR  1)  (TEMP  2)) 
(LABEL  1) 

(RETORJ  (TEMP  1)) 


Figure  5:  Icode  for  the  length  Method:  The  Icode  output  by  the  Front  End  ia  a  literal  translation  of  the  source 
code  with  few  optimizations.  At  this  point  all  method  calls,  including  primitives,  are  compiled  as  CSENDs. 


The  Statement  Analyzer  and  Optimizer  processes  and  optimizes  the  Icode  generated  by  the  Front  End.  It  performs 
all  of  the  compiler’s  optimizations  that  are  relevant  at  the  Icode  level  of  abstraction.  Internally  it  works  with  Icode 
in  the  form  of  a  directed  control-flow  graph.  These  optimizations  include  dead  code  elimination,  move  elimination, 
dataflow  transformations,  constant  folding,  tail  forwarding,  and  merging  of  identical  statements  on  both  sides  of 
paths  of  a  conditional.  The  optimizations  are  repeatedly  attempted  until  none  of  them  can  improve  the  code. 

The  Instruction  Generator  compiles  each  Icode  statement  to  a  number  of  quasi-MDP  instructions  and  outputs  the 
MDP  code  in  the  form  of  a  directed  control-flow  graph.  At  the  same  time,  the  Instruction  Generator  assigns  variables 
to  either  registers  or  memory  locations  and  performs  statement-epecific  optimizations  on  I  codes. 


The  Assembly  Code  Generator  inserts  branches  into  the  directed  graph  of  quasi-MDP  instructions  created  by  the 
Instruction  Generator  and  performs  several  peep-hole  optimisations.  The  important  optimizations  include  shifting 
instructions  wherever  possible  to  align  DC  (Load  Constant)  instructions  to  word  boundaries  (all  other  instruc¬ 
tions  need  only  be  aligned  at  half-word  boundaries)  and  combining  SEND  and  SENDE  instructions  to  SEND2  and 
SEND2E.  The  Assembly  Code  Generator  replaces  short  branches  by  long  ones  where  necessary;  such  replacements 
are  complicated  by  the  fact  that  long  branches  alter  the  value  of  MDP’s  register  RO.  The  Assembly  Code  Generator 
outputs  a  file  of  assembly  language  statements  which  can  be  read,  assembled,  and  executed  by  our  MDP  simulator 
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MDPSim  [lOj  Figure  6  contains  the  assembly  code  output  for  the  sample  method  length. 


MODULE  PAIR..LEEGTH 

DC  HSG:LoadCod*+18 

DC  {Class_PAIR},{Hsthod_L£RGTH}} 


MOVE  [2, A3] ,R0  ;  0 

XLATE  R0,A2,XLATE_0BJ  ;  O.S 

MOVE  1.R3  ;  1 

ADD  R3,C3.A3],R2  ;  1.5 

MOVE  [3.A2] ,R1  ;  2 

BEE  XL  Rl.'LOOl  ;  2.5 

MOVE  [4,A3].R1  ;  3 

BEIL  R1.-L002  ;  3.5 

DC  MSG:R*plyConst*4  ;  4 

VTAG  R1.1.R3  ;  6 

LSK  R3.-16.R3  ;  5.5 

SEBD2  R3.R0  ;  6 

SEED  R1  ;  6  •  5 

SEBD2E  C5.A3l.R2  ;  7 

BR  *L002  ;  7.5 

L001:  MOVE  [3.A21.RO  ;  8 

CALL  Send.Boda.Ir  ;  8.5 

DC  MSG : S«ndCoo>t+7  ;  9 

SEBD2  Rl.RO  ;  10 

DC  {Method. LEIGTH}  ;  11 

SEED  RO  ;  12 

SEED2  C3.A21.R2  ;  12.5 

SEED  [4, A3]  ;  13 

SEEDE  [6, A3]  ;  13.5 

L002:  SUSPEBD  ;  14 

EED 


Figure  6:  Final  Output  of  the  Compiler:  This  is  the  MDP  assembly  code  into  which  the  length  method  compiles.  If 
the  optimizations  were  turned  off,  the  code  size  would  have  been  32  words,  more  than  twice  the  size  of  the  optimized 

code. 


Optimizations 

Tail  Forwarder  The  tail  forwarder  performs  the  message-passing  equivalent  of  tail  recursion.  It  is  often  the  case 
that  the  value  returned  by  a  Concurrent  Smalltalk  method  is  the  value  returned  by  the  last  statement  of  that  method, 
and  that  statement  is  often  a  method  call.  An  example  of  this  phenomenon  is  a  recursive  definition  of  the  length 
function  in  Figure  2. 

If  edr  is  not  equal  to  nil,  the  length  method  makes  a  recursive  call  and  when  that  call  returns,  it  immediately 
returns  that  value  as  the  result.  There  is,  however,  no  fundamental  reason  why  length  should  wait  for  the  result 
of  the  recursive  call  to  length  only  to  return  it  to  the  caller;  on  the  contrary,  it  would  be  better  if  the  recursive 
length  call  returned  its  result  to  the  initial  caller,  length  optimized  this  way  runs  in  constant  space  instead  space 
proportional  to  the  list  length.  The  Tail  Forwarder  performs  this  optimization  by  looking  for  a  CSEND  statement 
whose  value  is  returned  by  a  REPLY  statement  immediately  afterwards.  Such  a  CSEND  statement  is  modified  to 
inform  the  callee  to  return  its  result  to  this  method’s  caller  instead  of  this  method. 


Fork  and  Join  Mergers  These  two  optimizations,  if  they  can  be  applied,  often  produce  significant  savings  in  the 
output  code  size.  They  try  to  consolidate  similar  statements  on  both  sides  of  forks  (conditionals)  and  joins  (places 
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where  two  paths  of  control  flow  merge)  in  the  control-flow  graph. 

The  Join  Merger  look*  'or  similar  statements  immediately  preceding  each  join  in  the  control-flow  graph.  Here  two 
statement*  are  cors’dered  to  be  similar  if  they  are  identical  or  if  they  are  both  CSENDs  with  identical  targets  and 
the  same  number  of  arguments;  the  arguments  themselves  need  not  be  the  same.  The  Join  Merger  moves  both 
statements  after  the  join;  if  the  statements  were  not  identical,  MOVEs  are  generated  to  copy  any  differing  arguments 
into  temporaries  before  the  join;  the  combined  statement  after  the  join  will  use  the  temporaries  instead  of  the  original 
arguments.  These  MOVEs  are  usually  later  removed  by  the  Move  Eliminator.  Although  more  than  two  paths  of 
control  flow  can  join  at  the  same  place,  the  Join  Merger  only  considers  them  pairwise;  if  more  than  two  paths  can 
be  merged,  initially  two  will  be  merged,  with  the  other  ones  considered  in  a  later  pass.  The  Fork  Merger  operates 
analogously  except  that  it  also  has  to  be  sure  not  to  affect  the  value  of  the  condition  determining  which  branch  the 
program  will  take. 

The  Join  Merger  occasionally  merges  two  completely  different  method  calls  which  happen  to  have  the  same  number 
of  arguments,  but  which  may  even  call  different  methods  (the  method  selector  is  treated  as  an  argument  like  any 
other),  a  rather  unexpected  optimization  indeed.  In  each  branch  just  before  the  join,  the  resulting  object  code  copies 
the  differing  method  arguments  into  the  MDP’s  registers  and  stores  the  appropriate  method  selector  in  a  register 
After  the  join  is  common  code  that  sent  the  message  given  the  method  selector  and  arguments  in  the  registers.  Since 
the  code  to  send  a  message  is  long  compared  to  the  code  to  load  values  into  registers,  the  optimization  has  a  net 
savings  of  five  words  (ten  instructions)  of  code  without  significantly  affecting  the  running  time. 


Move  Eliminator  For  each  MOVE  statement  from  a  local  variable  to  another  local  variable,  the  Move  Eliminator 
attempts  to  merge  the  source  and  destination  variables  into  one  variable  and  then  remove  the  MOVE  statement. 
Such  a  merge  can  be  done  successfully  if  the  two  variables  are  never  simultaneously  live  at  any  point  in  the  code. 

The  Move  Eliminator  complements  the  copy-propagation  algorithm  in  the  Optimist.  Although  both  try  to  optimize 
MOVE  statements,  each  is  able  to  handle  cases  that  the  other  cannot.  The  copy  propagation  can  handle  constants, 
while  Figure  7  shows  an  example  of  MOVE  statements  that  can  be  eliminated  by  the  Move  Eliminator  but  not  by 
copy  propagation. 


Figure  7:  Move  Eliminator  Example;  Tb  Move  Eliminator  is  able  to  remove  the  two  MOVE  statements  (a«— b)  and 
(a«—  e)  in  the  above  code  (the  anows  indicate  possible  flow  of  control  paths).  The  copy  propagation  algorithm  would 
not  detect  the  opportunity  to  remove  these  two  MOVE  statements  because  the  value  of  a  at  the  return  statement  is 
neither  a  copy  of  b  nor  a  copy  of  c.  The  above  code  does  occur  in  many  methods. 


Variable  Allocator  A  greedy  algorithm  is  used  to  assign  eligible  variables  to  registers.  The  shortest-lived  variables 
with  the  most  references  are  considered  first.  A  graph  coloring  algorithm  is  used  to  assign  the  variables  that  did  not 
fit  in  the  registers  to  context  slots,  thus,  fewer  context  slots  are  used,  saving  valuable  memory  space. 
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Sumir  jry 

In  this  paper,  we  have  presented  a  new  language,  Concurrent  Smalltalk,  that  is  designed  for  concurrency.  Specific 
support  for  concurrency  includes  locks,  distributed  objects,  and  asynchonous  message  passing. 

Distributed  Objects  represent  a  significant  innovation  in  programming  parallel  machines.  We  refer  to  the  constituents 
of  a  distributed  object  with  a  single  name,  but  the  implementation  of  the  object  is  with  many  constituents.  This 
different  perspective  allows  easy  use  of  distributed  objects  by  outside  programs  while  allowing  the  exploitation  of 
internal  concurrency. 

We  have  described  an  implementation  of  a  CST  system.  This  programming  environment  includes  a  compiler,  simu¬ 
lator,  and  statistics  collection  package.  This  set  of  tools  allows  us  to  experiment  with  new  constructs  and  implemen¬ 
tation  techniques  for  the  language.  Although  many  of  the  optimizations  used  by  the  Optimist  compiler  are  generally 
known,  they  have  usually  been  applied  to  compilers  for  conventional  processors.  The  issues  involved  in  compiling 
for  the  MDP  are  quite  different  from  compiling  for  conventional  processors.  After  examining  the  compiler’s  output, 
it  becomes  apparent  that  the  optimizations  are  essential  to  the  successful  use  of  Concurrent  Smalltalk  on  the  MDP. 
The  compiler’s  optimizations  reduce  the  amount  of  code  output  by  anywhere  between  20%  and  60%  (or  even  more 
in  some  cases)  compared  to  output  with  all  nonessential  optimizations  disabled.  Such  a  reduction  is  very  important 
on  a  processor  with  only  4096  words  of  primary  memory. 

There  are  many  open  issues  relating  to  CST  and  similar  programming  systems.  Key  efficiency  issues  remain  unre¬ 
solved:  how  fine  grain  will  the  programs  written  in  CST  be  and  what  is  the  run  time  overhead  of  CST  programs? 
There  are  also  concerns  about  the  expressive  power  of  languages  like  CST  -  how  easy  is  it  to  write  programs  in  CST 
and  how  useful  are  distributed  objects? 
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